Search CORE

95 research outputs found

GOGGLES: Automatic Image Labeling with Affinity Coding

Author: Chaba Sanya
Chau Duen Horng
Chu Xu
Das Nilaksh
Gandhi Sakshi
Wu Renzhi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/03/2020
Field of study

Generating large labeled training data is becoming the biggest bottleneck in building and deploying supervised machine learning models. Recently, the data programming paradigm has been proposed to reduce the human cost in labeling training data. However, data programming relies on designing labeling functions which still requires significant domain expertise. Also, it is prohibitively difficult to write labeling functions for image datasets as it is hard to express domain knowledge using raw features for images (pixels). We propose affinity coding, a new domain-agnostic paradigm for automated training data labeling. The core premise of affinity coding is that the affinity scores of instance pairs belonging to the same class on average should be higher than those of pairs belonging to different classes, according to some affinity functions. We build the GOGGLES system that implements affinity coding for labeling image datasets by designing a novel set of reusable affinity functions for images, and propose a novel hierarchical generative model for class inference using a small development set. We compare GOGGLES with existing data programming systems on 5 image labeling tasks from diverse domains. GOGGLES achieves labeling accuracies ranging from a minimum of 71% to a maximum of 98% without requiring any extensive human annotation. In terms of end-to-end performance, GOGGLES outperforms the state-of-the-art data programming system Snuba by 21% and a state-of-the-art few-shot learning technique by 5%, and is only 7% away from the fully supervised upper bound.Comment: Published at 2020 ACM SIGMOD International Conference on Management of Dat

arXiv.org e-Print Archive

Crossref

Rethinking Similarity Search: Embracing Smarter Mechanisms over Smarter Data

Author: Meng Jingfan
Rong Kexin
Wang Huayi
Wu Renzhi
Xu Jie Jeff
Publication venue
Publication date: 01/08/2023
Field of study

In this vision paper, we propose a shift in perspective for improving the effectiveness of similarity search. Rather than focusing solely on enhancing the data quality, particularly machine learning-generated embeddings, we advocate for a more comprehensive approach that also enhances the underpinning search mechanisms. We highlight three novel avenues that call for a redefinition of the similarity search problem: exploiting implicit data structures and distributions, engaging users in an iterative feedback loop, and moving beyond a single query vector. These novel pathways have gained relevance in emerging applications such as large-scale language models, video clip retrieval, and data labeling. We discuss the corresponding research challenges posed by these new problem areas and share insights from our preliminary discoveries

arXiv.org e-Print Archive

Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions

Author: Chu Xu
Gürel Nezihe Merve
Karlaš Bojan
Li Peng
Wu Renzhi
Wu Wentao
Zhang Ce
Publication venue
Publication date: 12/05/2020
Field of study

Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of "Certain Predictions" (CP) -- a test data example can be certainly predicted (CP'ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP'ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed -- we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of "data cleaning for machine learning (DC for ML)." We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques in terms of classification accuracy with mild manual cleaning effort

arXiv.org e-Print Archive

Repository for Publications and Research Data

Experiences and Lessons Learned from the SIGMOD Entity Resolution Programming Contests

Author: Bergamaschi Sonia
Chu Xu
De Angelis Andrea
Firmani Donatella
Li Peng
Mazzei Maurizio
Merialdo Paolo
Piai Federico
Simonini Giovanni
Wu Renzhi
Zecchini Luca
Publication venue
Publication date: 01/01/2023
Field of study

We report our experience in running three editions (2020, 2021, 2022) of the SIGMOD programming contest, a well-known event for students to engage in solving exciting data management problems. During this period we had the opportunity of introducing participants to the entity resolution task, which is of paramount importance in the data integration community. We aim at sharing the executive decisions, made by the people co-authoring this report, and the lessons learned

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

FAST discovery of a fast neutral hydrogen outflow

Author: Aditya J. N. H. S.
Allison James R.
Curran S. J.
Gu Minfeng
Li Di
Mahony Elizabeth K.
Su Renzhi
Tang Ningyu
Wu Zhongzu
Yoon Hyein
Zheng Zheng
Zhu Ming
Publication venue
Publication date: 04/09/2023
Field of study

In this letter, we report the discovery of a fast neutral hydrogen outflow in SDSS J145239.38+062738.0, a merging radio galaxy containing an optical type I active galactic nuclei (AGN). This discovery was made through observations conducted by the Five-hundred-meter Aperture Spherical radio Telescope (FAST) using redshifted 21-cm absorption. The outflow exhibits a blueshifted velocity likely up to

\sim-1000\,\rm km\,s^{-1}

with respect to the systemic velocity of the host galaxy with an absorption strength of

\sim -0.6\,\rm mJy\,beam^{-1}

corresponding to an optical depth of 0.002 at

v=-500\,\rm km\,s^{-1}

. The mass outflow rate ranges between

2.8\times10^{-2}

and

3.6\, \rm M_\odot \, yr^{-1}

, implying an energy outflow rate ranging between

4.2\times10^{39}

and

9.7\times10^{40}\rm\,erg\,s^{-1}

, assuming 100 K

<T_{\rm s}<

1000 K. Plausible drivers of the outflow include the star bursts, the AGN radiation, and the radio jet, the last of which is considered the most likely culprit according to the kinematics. By analysing the properties of the outflow, the AGN, and the jet, we find that if the HI outflow is driven by the AGN radiation, the AGN radiation seems not powerful enough to provide negative feedback whereas the radio jet shows the potential to provide negative feedback. Our observations contribute another example of a fast outflow detected in neutral hydrogen, as well as demonstrate the capability of FAST in detecting such outflows.Comment: Accepted by ApJ

arXiv.org e-Print Archive

Does a radio jet drive the massive multi-phase outflow in the ultra-luminous infrared galaxy IRAS 10565+2448?

Author: Aditya J. N. H. S.
Allison James R.
Chandola Yogesh
Chen Yongjun
Curran S. J.
Glowacki Marcin
Gu Minfeng
Liu Xiang
Mahony Elizabeth K.
Moss Vanessa A.
Sadler Elaine M.
Shao Xi
Su Renzhi
Weng Simon
Whiting Matthew T.
Wu Zhongzu
Yoon Hyein
Publication venue: 'Oxford University Press (OUP)'
Publication date: 02/02/2023
Field of study

We present new upgraded Giant Metrewave Radio Telescope (uGMRT) HI 21-cm observations of the ultra-luminous infrared galaxy IRAS 10565+2448, previously reported to show blueshifted, broad, and shallow HI absorption indicating an outflow. Our higher spatial resolution observations have localised this blueshifted outflow, which is

\sim

1.36 kpc southwest of the radio centre and has a blueshifted velocity of

\sim 148\,\rm km\,s^{-1}

and a full width at half maximum (FWHM) of

\sim 581\,\rm km\,s^{-1}

. The spatial extent and kinematic properties of the HI outflow are consistent with the previously detected cold molecular outflows in IRAS 10565+2448, suggesting that they likely have the same driving mechanism and are tracing the same outflow. By combining the multi-phase gas observations, we estimate a total outflowing mass rate of at least

140\, \rm M_\odot \,yr^{-1}

and a total energy loss rate of at least

8.9\times10^{42}\,\rm erg\,s^{-1}

, where the contribution from the ionised outflow is negligible, emphasising the importance of including both cold neutral and molecular gas when quantifying the impact of outflows. We present evidence of the presence of a radio jet and argue that this may play a role in driving the observed outflows. The modest radio luminosity

L_{\rm1.4GHz}

\sim1.3\times10^{23}\,{\rm W\,Hz^{-1}}

of the jet in IRAS 10565+2448 implies that the jet contribution to driving outflows should not be ignored in low radio luminosity AGN.Comment: 12 pages, 9 figures, accepted for publication in MNRA

arXiv.org e-Print Archive

Theoretical Investigations into Self-Organized Ordered Metallic Semi-Clusters Arrays on Metallic Substrate

Author: B Diaconescu
C Didiot
D Vanderbilt
G Kresse
G Kresse
G Kresse
H Brune
Han-Yue Zhao
HJ Monkhorst
JP Perdew
JP Perdew
JV Barth
K Wu
M Renzhi
MC Payne
ME Gonzalez-Mendez
Nan-Xian Chen
NX Chen
NX Chen
OL Alerhand
P Monachesi
R Gastel van
SC Li
VI Marchenko
XC Wang
XC Wang
XC Wang
Xiao-Chun Wang
Y Long
Y Long
Y Yafet
Yong Zhang
Publication venue: Springer
Publication date: 01/01/2010
Field of study

Using the energy minimization calculations based on an interfacial potential and a first-principles total energy method, respectively, we show that (2 × 2)/(3 × 3) Pb/Cu(111) system is a stable structure among all the [(n − 1) × (n − 1)]/(n × n) Pb/Cu(111) (n = 2, 3,…, 12) structures. The electronic structure calculations indicate that self-organized ordered Pb semi-clusters arrays are formed on the first Pb monolayer of (2 × 2)/(3 × 3) Pb/Cu(111), which is due to a strain-release effect induced by the inherent misfits. The Pb semi-clusters structure can generate selective adsorption of atoms of semiconductor materials (e.g., Ge) around the semi-clusters, therefore, can be used as a template for the growth of nanoscale structures with a very short periodic length (7.67 Å)

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central